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Abstract 

We  present  a  method  for  estimating  the  shape  of  a  deformable  model  using  the  least-squares 
residuals  from  a  model-based  optical  flow  computation.  This  method  is  built  on  top  of  an 
estimation  framework  using  optical  flow  and  image  features,  where  optical  flow  affects  only 
the  motion  parameters  of  the  model.  Using  the  results  of  this  computation,  our  new  method 
adjusts  all  of  the  parameters  so  that  the  residuals  from  the  flow  computation  are  minimized.  We 
present  face  tracking  experiments  that  demonstrate  that  this  method  obtains  a  better  estimate 
of  shape  compared  to  related  frameworks. 

Index  terms:  non-rigid  shape  and  motion  estimation,  model-based  optical  flow,  deformable 
models 


1  Introduction 


Applications  of  model-based  face  tracking  require  both  identifying  users  and  interpreting  their  ac¬ 
tions.  Only  by  watching  faces  can  conversational  interfaces  [5],  interactive  kiosks  [24]  and  robots 
[11]  start  to  understand  more  features  of  natural  face-to-face  dialogue.  And  only  by  gathering 
accurate  motion  estimates  of  faces  can  facial  animation  systems  be  automated  [27]  or  videos  of 
faces  be  compressed  effectively  [15,  16,  27].  Even  though  some  of  these  systems  do  not  need  to 
know  about  the  user’s  appearance,  having  an  accurate  estimate  of  the  face  shape  is  still  important, 
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as  it  makes  the  system  more  accurate,  efficient  and  robust.  This  argument  can  be  made  for  most 
domains  which  lend  themselves  to  model-based  vision. 

A  model  of  a  viewed  object  invites  one  to  distinguish  parameters  which  describe  the  object’s 
underlying  and  unchanging  shape  from  those  which  describe  its  motion — temporary  non-rigid 
deformations  away  from  this  shape.  Making  such  a  categorization  is  a  first  step  towards  the  difficult 
problem  of  simultaneously  estimating  shape  and  motion.  Because  the  interpretation  of  shape  and 
motion  from  an  image  is  highly  ambiguous,  it  is  also  important  to  consider  how  parameters  are 
informed  by  the  observations. 

Consider  a  model-based  optical  flow  computation,  for  example.  Observed  motion  can  directly 
constrain  only  the  motion  parameters;  the  true  values  of  the  shape  parameters  do  not  change  over 
time.  However,  inaccurate  shape  estimates  can  make  it  impossible  for  the  model  to  explain  the 
observed  motion  using  the  motion  parameters — this  produces  errors  in  the  perceived  motion.  In 
this  case,  we  can  adjust  the  shape  parameters  to  improve  the  entire  estimate. 

The  challenge  is  to  be  faithful  to  the  distinction  between  shape  and  motion  parameters  as  the 
adjustment  is  computed.  For  example,  it  is  insufficient  to  simply  stage  the  computation,  and  use 
the  leftovers  from  the  motion  estimate  to  feed  a  computation  which  determines  how  the  shape 
parameters  could  have  changed  over  time  to  explain  the  remaining  observed  motion  [15].  This 
method  simply  treats  shape  parameters  as  motion  parameters.  In  this  paper,  we  propose  computing 
an  adjustment  to  the  shape  parameters  that  minimizes  the  error  in  the  motion  estimate.  In  other 
words,  we  determine  a  new  configuration  for  which  the  motion  parameters  would  have  produced 
less  error  in  the  first  place. 

Using  flow  to  estimate  shape  is  the  subject  of  the  large  body  of  work  on  structure  from  motion 
using  an  optical  flow  field,  reviewed  in  [1].  There  has  also  been  a  great  deal  of  work  on  the  structure 
from  motion  problem  using  feature  correspondences,  which  is  surveyed  in  [13].  Applying  these 
techniques  to  tracking  and  estimating  the  shape  of  faces  is  rather  difficult,  as  most  methods  are  not 
suited  to  non-rigid  motion;  there  have  only  been  successful  implementations  of  this  quite  recently 
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[9,  14]. 

This  paper  describes  an  alternative  to  structure  from  motion  methods;  our  method  is  coupled  to 
a  model-based  optical  flow  computation  using  a  deformable  model.  Instead  of  performing  direct 
surface  reconstruction  from  optical  flow,  our  method  indirectly  adapts  model  parameters  so  that 
the  error  in  the  model-based  optical  flow  is  reduced. 

1.1  Separation  of  shape  and  motion 

The  starting  point  for  our  method  is  an  intuitive  distinction  between  shape  and  motion;  to  account 
for  this  distinction,  the  process  of  model  design  must  encode  information  about  a  class  of  objects 
by  categorizing  parameters  as  describing  variation  in  shape  or  motion.  The  shape  parameters  are 
static  quantities  for  a  particular  observed  object,  and  describe  its  unchanging  geometric  features. 
The  motion  parameters  are  dynamic  quantities,  which  change  when  the  observed  object  moves  or 
deforms.  Of  course,  there  is  no  guarantee  that  the  shape  and  motion  of  some  class  of  objects  is 
separable  in  an  arbitrary  parameterization;  this  is  a  simplifying  assumption  that  we  make,  and  will 
only  apply  to  a  certain  degree  of  accuracy.  It  works  quite  well  for  models  of  human  faces,  for 
instance,  where  shape  parameters  describe  an  individual’s  appearance,  while  motion  parameters 
encode  the  location  of  the  head,  as  well  as  facial  displays  and  expressions.  This  division  is  often 
built  into  face  models  [3,  6,  16,  19,  28]  to  simplify  model  construction  or  estimation,  and  has  been 
used  to  facilitate  learning  the  variability  of  motions  for  a  class  of  objects  [25]. 

The  ultimate  goal  of  this  separation  is  to  produce  an  estimation  problem  with  lower  dimension. 
During  estimation,  the  change  in  the  shape  parameters  should  tend  to  zero  as  the  shape  of  the  ob¬ 
served  object  is  established.  Once  this  occurs,  fitting  need  only  continue  for  the  motion  parameters. 
Therefore,  during  model  design,  the  separation  into  shape  and  motion  should  encode  as  many  of 
the  model  deformations  with  shape  parameters  as  possible.  This  decision  leads  to  a  more  efficient 
tracking  system. 

For  models  with  separate  parameters  for  shape  and  motion,  certain  cues  such  as  optical  flow 
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arise  due  to  motion  in  the  scene,  and  are  appropriately  used  only  for  the  estimation  of  motion  pa¬ 
rameters  (and  not  shape  parameters).  While  in  some  cases  updating  all  the  parameters  (shape  and 
motion)  based  on  the  flow  can  result  in  smaller  deviations  [15],  this  is  missing  the  point  of  sepa¬ 
rating  shape  and  motion  in  the  first  place,  and  is  in  conflict  with  the  view  of  the  shape  parameters 
as  having  static  values. 

1.2  Using  residuals 

A  significant  error  in  the  current  model  estimate  will  interfere  with  the  optical  flow  estimates  of 
the  motion,  since  the  model  and  image  will  be  misaligned.  For  example,  if  the  estimate  of  a  “nose 
protrusion”  parameter  is  inaccurate,  the  pattern  of  motion  that  the  model  would  predict  during  a 
head  turn  would  be  incorrect  in  the  local  region  of  the  nose.  More  specifically,  if  the  estimate  of 
how  much  the  nose  protrudes  is  much  less  than  it  should  be,  then  the  model  would  only  be  able  to 
explain  a  fraction  of  the  motion  information  in  the  image  region  nearby  the  nose.  As  seen  from  this 
example,  the  interference  is  quite  systematic,  which  makes  it  possible  to  adjust  the  current  shape 
estimate.  Here,  we  would  aim  to  adjust  the  nose  protrusion  parameter  in  a  way  that  reduces  the 
motion  model’s  error  (so  it  better  explains  the  image).  We  will  perform  this  adjustment  using  the 
residual  from  the  optical  flow  constraint  equation. 

From  this  residual,  each  pixel  used  in  the  optical  flow  computation  supplies  one  piece  of  in¬ 
formation  which  is  then  used  to  determine  how  the  parameters  can  be  corrected  to  minimize  the 
interference.  However,  some  pixels  will  not  supply  any  useful  information.  And  even  worse,  many 
of  the  pixels  will  include  distracting  information  resulting  from  optical  flow  linearization,  optical 
flow  constraint  violations  (such  as  lighting  changes,  shadows,  or  specularities),  motion  estimation 
errors,  and  noise.  As  a  result,  we  must  be  sure  to  use  a  sufficient  number  of  pixels,  as  well  as  to 
avoid  adjusting  the  parameters  based  on  distracting  information.  We  have  to  assume  that  resid¬ 
ual  contributions  that  result  from  small  errors  in  the  estimated  shape  are  significantly  larger  than 
those  caused  by  distracting  sources.  We  confirm  this  assumption  empirically  in  the  context  of  face 
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tracking  in  Section  5.2,  and  it  is  likely  to  apply  in  other  domains  where  tracking  with  model-based 
optical  flow  is  successful. 

The  work  in  this  paper  builds  on  the  model-based  face  tracking  framework  described  in  [7]. 
This  framework  uses  a  model-based  optical  flow  computation  as  a  constraint  on  the  motion  of  a 
deformable  model,  which  uses  features  (edges)  to  align  the  model  with  the  image.  Using  features 
prevents  the  accumulation  of  tracking  error,  which  would  have  otherwise  been  a  difficulty  using 
flow  alone.  With  our  proposed  method,  changes  in  the  image  are  initially  attributed  entirely  to  mo¬ 
tion,  but  then  the  error  in  the  reconstructed  motion  is  used  to  more  accurately  extract  the  parameters 
of  the  object  being  tracked. 

After  a  brief  review  of  deformable  models  in  Section  2,  we  discuss  existing  approaches  to 
model-based  optical  flow  in  Section  3.  Section  4  describes  our  method  for  adjusting  the  model 
parameters  using  residuals  from  a  model-based  optical  flow  computation.  Section  5  presents  and 
discusses  experiments  which  demonstrate  how  our  technique  improves  the  shape  estimate  of  a 
tracked  face. 


2  Deformable  models 

Deformable  models  [18,  23,  29]  are  parameterized  shapes  that  deform  due  to  forces  according 
to  physical  laws.  For  vision  applications,  physics  provides  a  useful  analogy  for  treating  shape 
estimation  [18],  where  forces  are  determined  from  visual  cues  such  as  edges  in  an  image.  The 
deformations  that  follow  produce  a  shape  that  agrees  with  the  data. 

The  shape  of  the  deformable  model  x  is  parameterized  by  a  time-varying  vector  of  values  q  and 
is  defined  over  a  domain  Q  which  can  be  used  to  identify  specific  points  on  the  model;  a  particular 
point  on  the  model  is  written  as  x(q;u)  with  u  E  Q  although  the  dependency  of  x  on  q  is  often 
omitted.  The  goal  of  shape  and  motion  estimation  is  to  recover  the  value  of  q  over  time  from  a 
sequence  of  images.  For  this  paper,  we  will  be  using  the  three-dimensional  parameterized  face 
model  from  [7]. 
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As  stated  earlier,  to  distinguish  the  processes  of  shape  estimation  and  motion  tracking,  the  pa¬ 
rameters  in  q  are  rearranged  and  separated  into  qb  (the  basic  shape  of  the  object)  and  qm  (rigid 
and  non-rigid  motion),  so  that  q  =  (q^  Within  our  face  model,  qb  describes  an  individual’s 

appearance  (there  are  about  80  shape  parameters),  while  qm  encodes  the  location  of  their  head, 
as  well  as  their  facial  displays  and  expressions  (there  are  12  motion  parameters).  Figure  1  shows 
examples  of  the  face  model  undergoing  various  shape  deformations  (showing  four  different  indi¬ 
viduals),  motion  deformations  (showing  brow  raising  and  frowning,  smiling,  and  mouth  opening) 
and  finally  two  examples  of  when  several  deformations  are  applied  at  once.  Further  detail  about 
this  model  can  be  found  in  [7]. 

MM  —  M 

I  a  \  I  1  if  V  f  motion  If  y  il  shape  I  {3  f  I  4^3  j 

Vfjr  only  \*t!  only  /  Vjr^y 

8® 

Figure  1 :  Example  parameterized  deformations  of  the  face  model  (with  separate  parameters  for 
shape  and  motion) 

The  model  x  is  formed  by  applying  deformation  functions  to  the  underlying  shape  s.  For  this 
paper,  the  underlying  face  model  s  is  a  polygon  mesh  (shown  in  the  center  of  Figure  1).  There 
are  separate  deformation  functions  for  shape  (T/,)  and  for  motion  (' Tm ).  The  shape  deformation  is 
applied  first,  so  that: 


x(q;u)  =  Tm (qm;  T& (qb;  s(u)))  (1) 

The  shape  deformation  uses  the  parameters  qb  to  deform  the  underlying  shape  s.  On  top  of  this 
is  the  motion  deformation  Tm  with  parameters  qm,  which  includes  a  rigid  translation  and  rotation 
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(head  motion),  as  well  as  non-rigid  deformations  (facial  expressions  and  displays). 

When  modeling  a  three-dimensional  object  viewed  in  images,  x  includes  a  camera  projection, 
resulting  in  a  two-dimensional  model  called  xp,  which  is  projected  flat  from  the  original  three- 
dimensional  model. 

2.1  Kinematics  and  dynamics 

The  kinematics  of  the  model  are  determined  in  terms  of  the  parameter  velocities  q.  As  the  shape 
changes,  the  velocity  at  a  point  u  on  the  model  is  given  by: 

x(u)=L(q;u)q  (2) 

where  L  =  3x/dq  is  the  model  Jacobian  [18].  For  reasons  of  conciseness,  the  dependency  of  L  on 
q  is  often  omitted. 

We  view  L  as  consisting  of  components  that  correspond  to  qb  and  qm,  so  that  it  can  be  written 
as  [Lb  Lm],  The  Jacobian  of  xp  (the  projected  model)  is  written  as  Lp,  and  is  decomposed  into 
components  for  qb  and  qm  as  [Lb  Lmp] .  In  Section  3,  the  image  velocities  associated  with  the 
optical  flow  are  modeled  by  xmp,  which  is  the  projected  motion  that  arises  due  to  changes  in  the 
motion  parameters: 


xmp(u)  =  Lmp(q;u)qm  (3) 

The  shape  parameters  are  not  included,  as  <jb  is  not  a  characteristic  of  the  scene — instead,  it  reflects 
how  the  shape  estimate  changes  as  more  images  arrive. 

The  models  defined  above  are  useful  for  applications  such  as  shape  and  motion  estimation  when 
used  in  a  physics-based  framework  [18].  These  techniques  are  a  form  of  optimization  whereby 
the  deviation  between  the  model  and  the  data  is  minimized.  The  optimization  is  performed  by 
integrating  differential  equations  derived  from  the  Euler-Lagrange  equations  of  motion.  These 
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equations  are  simplified  in  a  standard  manner  [18],  and  in  this  case  result  in: 


q  =  fq  (4) 

where  the  applied  forces  fq  are  computed  from  two-dimensional  image  forces  fimage  as: 

fq  =  y~!  Lp  (u  / )  Tfjmage  (u  j  )  (5) 

j 

The  distribution  of  forces  on  the  model  is  based  in  part  on  forces  computed  from  the  edges  of 
an  input  image  [18].  With  that,  and  given  an  adequate  model  initialization,  these  forces  will  align 
features  on  the  model  with  image  features,  thereby  determining  appropriate  parameter  values.  The 
dynamic  system  in  (4)  is  solved  by  integrating  over  time,  using  standard  (explicit)  differential 
equation  integration  techniques,  such  as  Euler  integration: 

q(t  +  At)  =  q(t) +  q(r)At  (6) 

When  implementing  this  framework  using  a  Kalman  filter  [7],  (6)  becomes  the  discrete  update 
equation  for  the  state.  The  initialization  which  specifies  the  value  of  q(0)  is  described  in  Section  5. 

3  Model-based  optical  flow 

The  optical  flow  is  typically  defined  as  the  apparent  motion  of  brightness  patterns  across  an  image 
[12].  Attempting  to  use  this  information  in  applications  such  as  object  tracking  requires  assump¬ 
tions  about  the  objects  (or  scene)  being  viewed.  Most  common  is  the  assumption  that  particular 
locations  on  viewed  objects  do  not  change  in  brightness.  This  brightness  constancy  assumption 
leads  to  the  formulation  of  the  well-known  optical  flow  constraint  equation  at  a  pixel  i  in  the  image 
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I: 


VI,  U‘  +It,.=0  (7) 

Vi 

where  VI  =  [lx  Iy]  are  the  spatial  derivatives  and  It  is  the  temporal  derivative  of  the  image  intensity. 
Ui  and  Vi  are  the  components  of  the  image  velocities  at  pixel  i. 

The  model-based  optical  flow  constraint  equation  is  a  reformulation  of  (7)  in  terms  of  a  model’s 
motion  parameters  qm.  When  viewing  a  model  under  projection,  there  exists  a  unique  model  point 
u,  G  £2  which  corresponds  to  a  particular  pixel  (except  on  occluding  boundaries  and  situations 
involving  transparency).  In  a  model-based  approach,  the  image  velocities  «,■  and  v,-  are  specified  by 
projected  velocities  of  points  on  the  model  xm  (u;)  given  by  (3): 

Xmp(u,)  =Lni;)(u,)qm  (8) 

Note  that  only  the  changes  resulting  from  the  motion  parameters  qm  are  included,  as  optical  flow 
velocities  do  not  reflect  changes  in  the  shape  parameters  qj,.  The  model-based  optical  flow  con¬ 
straint  equation  is  developed  by  rewriting  (7)  using  (8): 

VI;  Lmp  (u,-)qm  +  It(.  =  0  (9) 

When  considered  over  a  set  of  n  pixels,  a  stacked  set  of  instances  of  (9)  can  be  written  in  matrix 
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form  as: 


r  i 

< 

i— i 

r1 

...  g 

flm  + 

In 

(uw) 

K 

which  can  be  written  compactly  as 


(10) 


Bqm  +  It  =  0  (11) 

Formulations  similar  to  (11)  (although  superficially  appearing  quite  different)  can  be  found  in 
[2,  15,  16,21,22], 

An  estimate  of  qm  (written  as  qm)  using  (1 1)  is  determined  by  solving  a  least-squares  problem: 

5m  =  arg  min  ||Bqm  +  It||2  (12) 

Qm 

Iterative  approaches  to  solving  this  problem  using  techniques  such  as  the  Gauss-Newton  method 
[10]  are  taken  in  [2, 15, 21,  22],  The  solution  in  [7]  performs  a  single  step  using  the  pseudo-inverse 
(where  B+  is  the  pseudo-inverse  of  B  [26])  [22,  16]: 

5m  =  -B+It  (13) 

This  is  the  linear  least-squares  solution;  it  linearizes  by  assuming  Lni/)  is  constant  (ignoring  its 
dependency  on  q).  This  simple  solution  is  sufficient,  as  it  is  being  combined  with  the  result  of  an 
iterative  template-alignment  problem  (using  edges),  which  yields  a  system  with  higher  accuracy 
and  more  robust  behavior.  Typical  problems  encountered  in  flow  computation  are  avoided  by 
strategic  selection  of  pixels  for  use  in  (1 1)  [7], 
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The  derivation  of  (7)  involves  the  truncation  of  a  Taylor  series,  and  as  a  result  requires  rela¬ 
tively  small  motions  between  frames.  To  address  problems  with  estimating  larger  motions  (or  to 
implement  coarse-to-fine  methods),  some  iterative  approaches  transform  the  model  geometry  at 
each  iteration  using  the  previous  motion  estimate  [15],  while  others  undo  the  previous  estimate  by 
warping  the  input  images  [2],  The  system  in  [7]  assumes  small  motions  (although  of  course  could 
be  extended  using  the  above  methods). 

The  most  serious  difficulty  for  these  techniques,  however,  is  combating  tracking  drift.  Using 
only  velocity  information,  small  estimation  errors  accumulate  over  time.  The  solution  to  this 
problem  is  to  include  other  information  (such  as  features  or  edges)  to  prevent  errors  from  building 
up  [7,  15,  16], 

3.1  Optical  flow  residuals 

Unlike  image-based  optical  flow  techniques,  these  model-based  methods  do  not  require  assump¬ 
tions  about  the  smoothness  of  the  flow  field  to  determine  a  solution,  as  the  number  of  pixels  pro¬ 
viding  useful  information  is  sufficiently  greater  than  the  number  of  motion  parameters.  Of  course, 
now  that  the  solution  is  over-determined,  there  will  be  a  residual  from  the  least-squares  solution, 
given  the  estimate  qm: 


r  =  B^m  +  It  (14) 

The  residual  r  is  a  vector  having  dimension  n  (the  number  of  pixels  used  in  the  flow  computation). 

Contributions  to  the  residual  come  from  many  sources.  Aside  from  measurement  noise,  most 
obviously  are  linearization  errors  that  result  from  ignoring  the  higher  order  terms  in  (7)  (which 
were  truncated  in  the  Taylor  series),  and  in  the  use  of  a  linear  least  squares  solution.  Other 
contributing  factors  include  violations  of  the  brightness  constancy  assumption  such  as  lighting 
changes,  shadows,  and  specularities.  Finally,  shape  and  motion  estimation  errors  (deviation  be¬ 
tween  the  current  estimate  of  q  and  its  actual  value)  will  prevent  the  model  from  properly  aligning 
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with  the  image,  and  will  cause  a  sizable  increase  in  the  residual. 

We  claim  that  if  significant  errors  are  present  in  the  estimated  shape  and  motion,  they  will  be 
the  primary  contributors  to  the  residual.  We  support  this  claim  empirically  in  Section  5.  This 
means  that  the  residual  is  a  valuable  piece  of  information  that  can  be  used  for  estimating  shape 
and  motion,  allowing  us  to  compute  small  adjustments  to  the  shape  and  motion  parameters  which 
reduce  the  residual,  as  we  see  in  the  next  section. 

4  Adjusting  parameters  using  residuals 

There  are  many  approaches  which  use  residuals  to  improve  the  fit  of  the  model;  we  will  describe 
three  in  this  section.  Each  of  these  approaches  involve  the  minimization  of  a  distinct  least-squares 
problem.  The  differences  between  approaches  are  best  explained  by  how  they  respect  the  separa¬ 
tion  of  shape  and  motion  parameters  in  the  model.  The  last  approach  we  describe  here  is  the  main 
contribution  of  this  paper — and  it  is  also  the  only  method  that  truly  respects  the  separation  between 
shape  and  motion  parameters. 

The  naive  approach  treats  all  of  the  model  parameters  as  motion  parameters.  This  results  in  the 
following  model-based  optical  flow  constraint  equation,  as  an  alternative  to  (1 1): 


Bb  B 


Qb 

Qm 


+  It  —  Bbqb  +  Bqm  +  It  —  0 


(15) 


where  the  construction  of  Bb  is  analogous  to  B,  but  uses  Lb  instead  of  Lm.  As  all  parameters  in  q 
are  treated  as  motion  parameters,  one  can  solve  for  q  using  the  following: 


min  ||Bbqb  +  Bqm  +  It||2  (16) 


This  does  not  respect  the  separation  of  the  parameters  into  shape  and  motion  as  it  conflicts  with 
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the  static  interpretation  of  the  shape  parameters.  This  means  it  can  produce  shape  estimates  that 
are  in  disagreement  with  the  static  shape  of  the  viewed  object.  In  practice,  the  treatment  of  all  pa¬ 
rameters  as  dynamic  makes  this  method  more  computationally  expensive  and  quite  fragile.  Further 
discussion  on  this  point  is  made  in  Section  5. 

Another  possible  approach  explains  the  residual  as  directly  resulting  from  shape  deviation.  In 
other  words,  the  leftover  motion  not  accounted  for  in  qm  is  used  to  update  the  shape  with  the 
same  formulation  as  for  determining  motion  (from  Section  3).  This  is  realized  by  the  following 
minimization: 


min 

Qb 


®bQb  +  Bqm  +  It 


2 


min||Bbqb  +  r 

Qb 


(given  qm) 


(17) 


Instead  of  solving  one  large  system,  the  minimization  in  (16)  is  split,  and  is  solved  for  motion 
first  by  (12),  and  then  for  shape  in  terms  of  the  residual  r  by  (17).  This  method  is  related  to 
one  described  by  Koch  [15]  where  shape  parameters  are  actually  updated  in  two  steps;  first  using 
discrepancies  parallel  to  the  line  of  sight,  then  those  that  are  perpendicular  to  the  line  of  sight. 

This  is  a  reasonable  approach  in  the  context  of  image-coding  [15],  where  image  fidelity  is  of 
much  greater  importance  than  the  accuracy  of  the  face  shape  estimate — the  face  shape  is  deformed 
to  account  for  the  tracking  errors  in  motion.  This  produces  a  face  shape  that  results  in  better  image 
fidelity,  but  does  not  necessarily  estimate  the  actual  shape  of  the  subject’s  face  (as  a  result,  the 
estimated  covariance  of  the  shape  is  much  larger).  Plus,  given  our  distinction  between  shape  and 
motion  parameters,  it  does  not  make  sense  to  adjust  the  shape  parameters  qb  directly  from  observed 
velocities,  since  the  true  value  of  qb  is  a  static  quantity. 

We  take  a  different  approach.  Instead,  we  determine  the  small  change  in  q  that  effects  the 
largest  reduction  in  r.  Let  Aq  be  the  deviation  between  the  current  estimate  of  q  and  its  true  value 
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(not  including  the  motion  extracted  in  qm).  We  can  estimate  Aq  by  solving  the  following: 


Aq  =  arg  min 

Aq 


B(q  +  Aq)^m  +  I, 


(given  4,,,) 


(18) 


How  is  this  different?  The  estimate  Aq  will  tell  us  how  the  shape  (and  motion)  could  have 
been  different  to  produce  a  smaller  residual  in  the  first  place.  Returning  to  our  example  from  the 
introduction,  finding  Aq  would  tell  us  that  if  the  nose  protruded  further,  the  model’s  motion  would 
have  agreed  better  with  the  flow  information. 

Formally,  consider  what  results  if  we  assume  Aq  is  of  sufficiently  small  magnitude  so  that  the 
first-order  approximation  to  Lm  using  its  Taylor-series  expansion  is  sufficiently  accurate: 

3Lm„(u;q) 

Lm,,  (u;  q  +  Aq)  «  Lmp  (u;  q)  4 - j- — -Aq  (19) 


We  can  now  write  down  another  minimization  problem  that  is  equivalent  to  (18)  given  this 
assumption.  Combining  this  approximation  of  Lmp  with  the  model-based  optical  flow  constraint 
equation  (9)  results  in: 

VI  Lmp (u)^m  +  VI  (3L^(u)Aq)  +  It  =  0  (20) 

where  3Lmp/dq  is  part  of  the  model  Hessian  (a  rank  3  tensor).  It  is  written  here  “curried”  with  Aq 
so  that  the  parenthesized  sub-expression  here  is  a  matrix. 


When  (20)  is  considered  over  n  pixels  from  the  input  image,  this  results  in  the  system: 


where  G 


(22) 


1  3q  ) 

dhmp(un)\ 

n  3q  J 

The  subscripts  [1  ...n]  in  the  construction  of  G  correspond  to  a  particular  row  in  (21).  The  trans¬ 
positions  performed  in  the  construction  of  G  allow  it  now  to  be  curried  with  qm  (this  construction 
transposes  the  second  and  third  indices  of  the  tensor  G).  We  can  now  rewrite  (21)  using  the  residual 
(14): 

(G$m)  Aq  +  r  =  0  (23) 

which  corresponds  to  solving  the  following  minimization: 

_  ^2 

Aq  =  arg  min  (Gqm)Aq  +  r  (24) 

Aq 

Solving  this  least  squares  problem  determines  the  best  set  of  small  changes  in  qb  and  qm  that 
minimize  the  optical  flow  residual  (14),  given  the  linearization  of  Lni;;  in  (19).  In  practice,  we 
solve  this  using  the  corrected  Gauss-Newton  method,  which  performs  well  in  cases  where  (Gqm) 
is  ill-conditioned  by  using  the  singular  value  decomposition.  In  fact,  the  linearization  in  (19)  is  the 
same  assumption  made  to  justify  use  of  the  Gauss-Newton  method,  so  we  are  really  still  solving 
(18).  Furthermore,  the  use  of  the  corrected  Gauss-Newton  method  here  makes  this  assumption  a 
safe  one,  in  terms  of  convergence. 

The  intuition  for  this  analysis — that  we’re  solving  a  minimization  that  is  faithful  to  the  distinc¬ 
tion  between  shape  and  motion  parameters — is  realized  here  in  the  formulation  of  a  minimization 
problem  very  different  from  (17).  Note  that  it  is  possible  that  the  new  value  of  r  would  be  smaller 
if  (17)  is  used  over  (18)  to  estimate  qm,  but  this  goes  against  the  assumption  that  qb  are  static 
parameters,  and  would  result  in  an  inappropriate  estimate. 
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4.1  Updating  the  solution 


The  framework  in  [7]  provides  us  with  a  filtered  estimate  of  q  (using  both  the  flow  and  edge 
information).  The  dotted  region  in  Figure  2  shows  this  framework.  The  method  from  the  previous 
section  is  also  shown  in  this  diagram  as  the  block  “adjustment  using  residuals”.  We  can  now  use 
the  adjustment  Aq  by  statistically  combining  it  with  the  filtered  solution  from  [7]. 


q(t+At) 


Figure  2:  Schematic  description  of  overall  system 


The  framework  in  [7]  uses  a  Kalman  filter,  which  provides  us  with  the  covariance  estimate 
Aq.  Having  uncertainty  information  for  Aq  will  allow  us  to  statistically  combine  these  solutions, 
as  shown  towards  the  right  in  Figure  2.  We  model  the  distracting  sources  in  r  (aside  from  Aq)  as 
zero-mean  Gaussian  disturbances  with  covariance  Ar  =  o^l.  Using  (23),  the  (inverse)  covariance 
of  Aq  is: 


AAq  ~  (Uqm)T^r  '  (Gqm) 


(25) 


We  use  ar  to  represent  the  contributions  to  r  from  sources  other  than  shape  and  motion  estima¬ 
tion  errors,  and  is  determined  in  Section  5.2  from  experiments  where  qb  (the  shape)  is  known  in 
advance. 

The  statistical  combination  of  these  solutions  allows  the  system  to  take  into  account  the  un¬ 
certainty  in  each.  More  specifically,  it  provides  a  principled  means  for  the  system  to  ignore  Aq  in 
situations  when  it  is  likely  to  be  contaminated  with  background  distractions.  Using  the  covariances 
Aaq  and  Aq^f  =  At2Aq,  the  new  value  of  q  is  found  by  rewriting  (23)  using  the  typical  means  to 
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combine  Gaussians  [4,  8],  which  weights  qA t  and  Aq  together  based  on  their  uncertainties: 


q(t  +  A/)  —  q(t)  + 


AqA/+AAq 


-1 


AqA/q(0^  +  AAqAq(t) 


-r 


(26) 


This  combination  assumes  that  Aq  is  conditionally  independent  of  q  given  the  images.  This  is 
a  good  approximation  for  our  implementation  because  we  use  only  the  shape  estimates  in  Aq.  This 
eliminates  any  overlap  because  qm  does  not  inform  the  shape  estimate.  Experiments  showed  there 
was  little  benefit  in  using  the  revised  motion  estimate  in  Aq,  as  the  edge  solution  was  already  quite 
accurate  and  stable. 


4.2  Implementation 

Solving  (18)  is  made  more  efficient  by  omitting  parameters  in  the  construction  of  G  which  cannot 
be  affected  based  on  qm.  For  example,  if  there  is  no  motion  extracted  on  the  forehead,  then  there 
is  no  reason  to  include  eyebrow  shape  parameters  in  G.  Another  example  is  when  the  extent  of 
a  parameter  is  simply  not  visible  in  the  image.  Whenever  there  is  any  motion  at  all,  typically 
about  half  of  the  shape  parameters  of  the  face  model  can  be  excluded  from  the  computations. 
The  derivatives  in  G  can  be  computed  either  analytically  or  numerically.  We  use  the  efficient  and 
modular  approach  for  analytical  evaluation  of  these  derivatives  as  described  in  [7], 

The  process  of  determining  Aq  can  also  be  iterated,  solving  (12)  and  (18)  repeatedly  to  obtain 
a  greater  improvement.  For  the  applications  here,  only  a  single  iteration  is  performed  (the  expense 
of  further  iterations  isn’t  justified,  given  the  small  benefit  they  typically  yield).  When  we  do  iterate 
this  process,  the  algorithm  does  indeed  converge  reliably.  In  other  applications  (using  a  model  with 
greater  non-linearity,  for  instance),  more  stable  least-squares  methods  could  be  employed  (such  as 
Levenberg-Marquardt  [10])  that  force  convergence  (although  perhaps  not  to  the  global  minimum). 

Virtually  all  of  the  computational  expense  involved  in  using  the  method  described  in  this  paper 
results  from  the  singular  value  decomposition  of  the  matrix  (Gqm).  This  matrix  is  nx  q,  where 
q  is  the  number  of  parameters  used  in  the  computation;  the  complexity  of  the  SVD  in  this  case  is 
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0(nq2). 


5  Experiments 

This  section  describes  a  number  of  face  tracking  experiments  on  two  representative  image  se¬ 
quences.  We  demonstrate  how  the  adjustment  method  using  residuals  is  a  significant  improvement 
over  the  framework  in  [7]  and  one  that  uses  (17).  We  also  justify  our  assumption  that  parameter 
estimation  errors  are  the  leading  contributor  to  the  residuals.  For  the  remainder  of  this  section, 
we  will  be  comparing  three  frameworks.  First,  is  the  original  framework  from  [7],  which  uses 
optical  flow  and  edges.  Second,  is  the  two-stage  framework  based  on  (17),  using  the  modification 
described  by  Koch  [15]  described  in  Section  4  which  starts  with  discrepancies  parallel  to  the  view¬ 
ing  direction  the  follows  with  those  that  are  perpendicular.  Third,  is  the  novel  residual  reduction 
method  presented  in  this  paper,  which  uses  (18).  The  second  and  third  methods  are  built  using 
the  framework  of  [7]  as  in  Figure  2,  where  the  result  is  statistically  combined  with  the  original 
solution  from  [7].  We  were  unable  to  compare  a  method  based  on  (16),  which  treats  all  parameters 
as  dynamic — it  was  too  unstable  (causing  the  system  to  lose  track)  to  provide  meaningful  results. 

The  image  sequences  are  8  bit  gray  images  at  NTSC  resolution  (480  vertical  lines).  In  the 
sequences,  the  width  of  the  face  in  the  image  averages  200  pixels.  A  single  subject  is  used  in  both 
experiments  presented  here.  We  validate  the  shape  estimates  (qb)  using  a  Cyberware  range  scan  of 
the  subject,  shown  in  Figure  3. 


Figure  3:  Range  scan  of  the  subject  (shaded  and  textured) 

The  entire  estimation  process  is  automatic,  except  for  the  initialization,  which  requires  the 
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manual  specification  of  several  landmark  features  in  the  first  frame  of  the  sequence  (the  eyebrow 
centers,  eye  corners,  nose  tip,  and  mouth  corners).  The  subject  must  also  be  at  rest  and  (approx¬ 
imately)  facing  forward.  Experience  has  shown  that  the  initialization  process  is  robust  to  small 
displacements  (i.e.  several  pixels)  in  the  selected  landmark  points.  Further  details  of  this  initial¬ 
ization  process  are  provided  in  [7]. 

For  each  of  the  tracking  examples,  several  frames  from  the  image  sequence  are  displayed, 
cropped  appropriately.  Below  each,  the  same  sequence  is  shown  with  the  estimated  face  superim¬ 
posed.  Accompanying  each  sequence  is  a  graph  which  indicates  the  accuracy  of  the  shape  estimate 
as  compared  to  the  available  range  scan.  The  graphs  display  the  RMS  error  in  the  shape,  measured 
at  the  vertices  of  the  polygonal  model.  Note  this  includes  a  uniform  scaling  of  the  model  so  that 
the  two  faces  are  the  same  size  (this  eliminates  the  depth  ambiguity — in  this  case,  the  estimated 
model  was  compared  at  96%  scale).  As  all  of  the  approaches  are  initialized  identically,  the  RMS 
error  at  the  first  frame  is  the  same  for  all  three  techniques.  This  comparison  ignores  the  motion 
parameters,  so  that  only  the  shape  is  compared  (ground  truth  for  motion  is  not  available). 

The  initialization  process  usually  takes  about  2  minutes  of  computation.  Afterwards,  process¬ 
ing  each  frame  using  the  method  in  [7]  takes  approximately  1.4  seconds  each.  The  two-stage 
approach  requires  an  additional  3  seconds  per  frame,  while  the  residual  reduction  method  adds  an 
additional  5  seconds  per  frame  (all  computation  times  are  measured  on  a  175  MHz  R 10000  SGI 
02).  For  all  three  methods,  120  pixels  are  used  in  the  optical  flow  computation,  selected  using  the 
methods  described  in  [7]  (using  more  pixels  does  not  significantly  alter  the  results). 

5.1  Estimation  experiments 

The  shape  estimation  validation  experiment  in  Figure  4  shows  the  subject  making  a  series  of  non- 
rigid  face  motions:  opening  his  mouth  in  (b)  and  (c),  smiling  in  (d)  through  (e),  and  finally  raising 
his  eyebrows  in  (f).  At  each  frame,  Figure  5  shows  the  extracted  shape  results  as  compared  against 
the  range  scan  of  the  subject,  for  all  three  techniques  (the  labels  correspond  to  what  was  given  in 
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their  descriptions  above). 

The  RMS  error  starts  at  around  1.7  cm  after  initialization.  For  the  original  approach  [7],  the 
error  steady  declines  over  the  course  of  the  experiment,  ending  around  1.3  cm.  The  two-stage 
approach  behaves  similarly,  but  with  consistently  better  performance  than  the  original  approach, 
and  ending  around  1 . 1  cm.  The  proposed  residual  reduction  method  performs  best,  ending  with 
an  RMS  error  of  0.85  cm.  In  addition,  most  of  the  adjustment  took  place  in  the  first  half  of  the 
sequence. 


(a)  Frame  1  (b)  Frame  9  (c)  Frame  13  (d)  Frame  20  (e)  Frame  24  (f)  Frame  32 

Figure  4:  Shape  estimation  experiment  1 
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Figure  5:  Shape  validation  of  experiment  1 

The  experiment  in  Figure  6  shows  the  subject  performing  small  head  motions  in  (a)  through 
(f)  while  smiling  in  (c)  and  (d),  and  finishing  with  a  significant  head  rotation  in  (g).  This  time, 


error  (cm) 
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the  RMS  error  starts  at  around  1.9  cm  after  initialization.  The  original  approach  shows  a  gradual 
reduction  over  the  sequence,  ending  just  under  1  cm,  with  the  large  reduction  in  error  around  frame 
50  corresponding  to  when  the  subject  turned  his  head  significantly  to  the  side  in  Figure  6(f)  and 
(g),  where  the  profile  view  contained  good  edge  information  to  fit  the  face  shape.  The  two- stage 
approach  exhibited  varied  performance,  sometimes  more  and  sometimes  less  accurate  than  the 
original  solution;  it  ends  just  above  1  cm.  The  residual  reduction  method  finishes  with  under  half 
of  the  RMS  error  as  the  other  techniques:  around  0.4  cm.  Again,  this  lower  level  was  reached 
fairly  quickly,  showing  another  advantage  of  using  the  error  residual  technique. 


(a)  Frame  1  (b)  Frame  11  (c)  Frame  18  (d)  Frame  24  (e)  Frame  35  (f)  Frame  46  (g)  Frame  57 

Figure  6:  Shape  estimation  experiment  2 

RMS  error  (cm) 


Figure  7:  Shape  validation  of  experiment  2 

In  all  experiments,  the  motion  parameter  values  change  appropriately,  and  at  the  correct  times. 
All  three  techniques  extracted  virtually  the  same  motion  parameter  values.  This  is  not  particularly 
surprising,  as  the  edge  information  will  maintain  fairly  accurate  estimates  of  the  motion  parame- 
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Figure  8:  Residual  magnitudes  (||r||RMs)  during  experiments 

ters.  We  conclude  this  method  is  primarily  of  use  for  shape  estimation.  However,  this  does  not 
mean  that  the  motion  parameters  can  be  omitted  in  the  formation  of  G,  as  this  would  mean  that 
deviations  that  could  be  best  explained  with  motion  parameters  will  be  incorrectly  explained  using 
shape  parameters. 

Note  that  for  these  experiments,  the  background  is  relatively  simple.  This  limitation  originates 
in  the  use  of  edges  from  [7]  for  aligning  the  model  and  the  image.  Given  the  assumption  of  small 
motions  by  our  system,  the  background  has  no  effect  on  any  of  the  flow  computations,  as  the  pixel 
selection  method  avoids  pixels  in  the  image  that  are  likely  to  be  part  of  the  background. 

5.2  Analysis  of  residuals 

The  derivation  of  the  method  using  the  residuals  in  Section  4  assumes  that  shape  error  is  the 
leading  contributor  to  the  residuals  from  the  motion  computation.  We  now  describe  our  analysis  of 
the  residuals  to  justify  this  claim  by  examining  the  residual  magnitudes  for  the  residual  reduction 
method.  First,  we  define  the  RMS  residual  magnitude  (in  pixel  intensity  units  in  the  range  [0, 1]) 
as: 


I  RMS 


(27) 


Figure  8  shows  the  residual  magnitudes  at  the  start  and  end  of  the  two  experiments  described  earlier 
(in  the  first  two  columns).  During  both  tracking  experiments,  the  average  residual  magnitudes 
started  fairly  high  and  became  considerably  lower. 

To  isolate  the  portion  of  the  residuals  caused  by  shape  error,  both  experiments  were  run  again; 
this  time,  the  initial  model  shape  was  taken  from  the  range  scan  of  the  subject  (shape  estimation 
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was  disabled  in  the  framework).  The  residuals  from  these  two  experiments  turned  out  to  be  fairly 
constant  over  the  entire  sequence;  the  third  column  of  Figure  8  contains  the  residual  magnitudes 
for  the  first  image  of  these  experiments.  By  performing  similar  experiments  on  many  other  image 
sequences,  we  estimated  the  standard  deviation  of  the  distracting  component  of  the  residuals  (ar) 
to  be  0.060. 

We  observe  a  dramatic  reduction  between  the  first  and  third  columns,  indicating  the  improved 
shape  directly  resulted  in  a  much  lower  residual.  It’s  also  clear  that  our  proposed  method  reduces 
the  residual  to  a  reasonable  level  (this  time,  compare  the  second  and  third  column).  This  enforces 
the  validity  of  our  assumption  that  parameter  estimation  error  is  responsible  for  the  bulk  of  the 
residual. 

5.3  Discussion 

Besides  having  improved  accuracy  over  the  original  and  two-stage  methods,  our  framework  ex¬ 
tracts  the  shape  of  the  face  without  needing  data  from  such  extreme  head  poses  (such  as  a  profile 
view).  Instead,  fewer  observations  are  needed  to  extract  the  shape,  so  that  the  static  part  of  the 
estimation  problem  converges  sooner. 

In  addition,  once  the  static  estimation  problem  is  complete,  there  is  no  need  to  perform  further 
adjustments  in  our  application,  as  there  is  little  improvement  in  the  motion  estimates.  We  can  also 
skip  the  adjustment  computation  in  situations  where  the  residual  is  small  (compared  to  or),  as  they 
would  tend  to  provide  little  useful  information.  For  an  application  like  face  tracking,  where  the 
motion  is  the  desired  output  (but  still  requires  an  accurate  shape  for  good  results),  this  is  ideal. 

We  find  that  performing  adjustments  using  this  method  is  quite  robust,  at  least  to  the  same  level 
as  the  underlying  flow  computation.  In  other  words,  the  proposed  algorithm  almost  never  worsens 
the  performance.  Of  course,  in  situations  where  there  is  large  optical  flow  violation  (such  as  a 
major  lighting  change),  both  the  adjustment  method  and  the  model-based  optical  flow  computation 
will  fail.  As  shown  in  the  validation  experiments  in  [7],  the  use  of  multiple  cues  can  still  provide 
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robust  behavior  even  in  some  of  these  difficult  situations. 


Like  in  [7],  we  also  tested  our  system  by  instigating  failure  in  various  ways.  First,  we  added 
Gaussian  noise  directly  to  the  images  of  increasing  variance  until  the  system  failed  (holding  or 
fixed).  In  this  case,  at  a  noise  level  of  8.3%  of  the  image  intensity,  the  residual  reduction  method 
failed  (i.e. — it  produced  a  spurious  shape  estimate  that  later  caused  tracking  failure).  At  only  a 
slightly  higher  noise  level  (9.1%),  the  flow  method  failed.  Second,  we  added  random  offsets  to 
the  initial  parameters  q  (after  the  initialization)  to  see  if  the  system  could  recover.  This  time, 
the  flow  method  from  the  original  framework  failed  first.  We  see  similar  behavior  for  sequences 
with  large  lighting  changes  or  other  violations  of  the  optical  flow  constraint  equation.  Solutions  to 
these  problems  can  come  on  two  fronts.  First  would  be  to  model  the  situation  better;  for  instance, 
using  a  more  general  form  of  the  optical  flow  constraint  equation  to  take  radiometric  variations 
into  account  [20].  Second,  and  perhaps  more  generally  applicable,  would  be  the  use  of  robust 
techniques  for  cue  integration,  which  expect  some  (but  not  all)  cues  or  computations  to  fail  at  any 
time  [17]. 

6  Conclusions 

We  have  presented  a  novel  deformable  model  technique  which  uses  residuals  from  a  model-based 
optical  flow  solution  to  refine  the  shape  of  the  model.  By  using  the  relationship  between  the  shape 
and  motion  parameterizations,  small  improvements  to  the  parameters  are  made  by  minimizing  the 
model-based  optical  flow  residuals.  It  was  the  separation  of  the  parameterization  which  made 
this  computation  possible,  since  additional  parameters  that  did  not  apply  in  the  model-based  flow 
computations  could  still  be  adjusted.  While  this  method  is  presented  in  the  context  of  face  tracking, 
it  could  certainly  be  applied  in  other  model-based  domains  with  separable  parameterizations. 

The  adjustment  computation,  along  with  the  cue  integration  methods  from  [7],  seems  to  be 
fairly  robust  to  small  optical  flow  constraint  equation  violations  or  approximations.  Besides  having 
greater  accuracy  than  a  framework  using  only  optical  flow  and  edges,  our  framework  extracts  the 
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shape  of  the  face  without  needing  data  from  extreme  head  poses  (such  as  a  profile  view).  Instead, 
much  smaller  subject  motions  are  required  to  extract  the  shape  information. 
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